Constructing Good Quality Web Page Communities
نویسندگان
چکیده
The World Wide Web is a rich source of information and continues to expand in size and complexity. To capture the features of the Web at a higher level to realise the information classification and efficient retrieval on the Web is becoming a challenge task. One natural way is to exploit the linkage information among the Web pages. Previous work such as HITS in this area is based on a set of retrieved pages to get a Web community that is a bunch of pages related to the query topics. Since the set of retrieved pages may contain many unrelated pages (noise pages) to the query topics, the obtained Web community sometimes is unsatisfactory. In this paper, we propose an innovative algorithm to eliminate noise pages from the set of retrieved pages and improve its quality. This improvement will enable existing community construction algorithms to construct good quality Web page communities. The proposed algorithm reveals and takes advantage of the relationships among concerned Web pages at a deeper level. The numerical experiment results show the effectiveness and feasibility of the algorithm. This algorithm could also be used solely to filter unnecessary Web pages and reduce the management cost and burden of Web-based data management systems. The ideas in the algorithm can also be applied to other hyperlink analysis.
منابع مشابه
Finding Web Communities by Maximum Flow Algorithm Using Well-Assigned Edge Capacities
A web community is a set of web pages that provide resources on a specific topic. Various methods for finding web communities based on link analysis have been proposed in the literature. The method proposed in this paper is based on the method using the maximum flow algorithm proposed in [7], [8]. Our objective of using the maximum flow algorithm is to extract a subgraph which can be recognized...
متن کاملClassification des réponses d'un moteur de recherche et évaluation de leur pertinence
In this paper, we propose a method for evaluation of relevance in Web pages. This work joins in the general framework of Information Retrieval (IR) and more precisely, with the aim of constructing an automatic summary in encyclopaedic style. This summary type allows a new approach of Web page classification. In this paper, we present our classification approach and we detail our method for Web ...
متن کاملConstructing a reliable Web graph with information on browsing behavior
Page quality estimation is one of the greatest challenges for Web search engines. Hyperlink analysis algorithms such as PageRank and TrustRank are usually adopted for this task. However, low quality, unreliable and even spam data in the Web hyperlink graph makes it increasingly difficult to estimate page quality effectively. Analyzing large-scale user browsing behavior logs, we found that a mor...
متن کاملConstructing Personal Knowledge Base: Automatic Key-Phrase Extraction from Multiple-Domain Web Pages
In the paper, we proposed a general framework that could automatically extract key-phrases from a collection of web pages concerning a specific topic with the help of The Free Dictionary and then construct a personal knowledge base. Both the base and visual feature in a web page are used to calculate the weight of each candidate phrase. The system extracts top p% key-phrases for each web page b...
متن کاملDiscovery and Analysis of Usage Patterns for Web Personalization
In this paper, we present a community discovery method based on information extraction from user sessions in order to find usage patterns. We have characterized the overall access from each page in the web site and have detected the pertinent users within communities using the potentially useful information available in different user sessions. Then we compare the proposed method with the rando...
متن کامل